II. Description of the data source
The data were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
The For-Hire Vehicle (“FHV”) trip records since 2009 until present including fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID. We are focusing on the time period from 2018-01-01 to 2018-12-31, so the data comes in the shape of 200+ million observations, and each row contains one trip infomation.
The base license number is matching with different vehicle companies, so that we will join the base-number file to define the vehicle types, and we only focus on Uber, Lyft, Via at this point.
The NYC Taxi Zones map provided by TLC and published to NYC Open Data(https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc). This map shows the NYC taxi zones corresponding to the pick up zomes and drop off zones, or location IDs, included in the FHV trip records. The taxi zones are roughly based on NYC Department of City Planning’s Neighborhood Tabulation Areas (NTAs) and are meant to approximate neighborhoods.
The NYC Weather data is provided by National Centers For Environmental Information (https://www.ncdc.noaa.gov/data-access). NCEI is the world’s largest provider of weather and climate data. Land-based, marine, model, radar, weather balloon, satellite, and paleoclimatic are just a few of the types of datasets available. The weather data we are using is collected from NY Central Park Station (USW00094728) from 2018-01-01 to 2018-12-31, which contains daily weather records such as wind, precipitation, snow and snow depth.
Statistics through June 30, 2018:
- 17.2 GB of raw data
- 200+ million for-hire vehicle total trips
- 365 daily weather records
Existing problem:
- R reads entire data set into RAM all at once. Total 17.2 GB of raw data would not fit in local memory at once.
- R Objects live in memory entirely, which cause slowness for data analysis.
- The TLC publishes base trip record data as submitted by the bases, and we cannot guarantee or confirm their accuracy or completeness.
III. Description of data import / cleaning / transformation
3.1 Libraries and Dependencies
library(tibble) # data wrangling
library(reshape2) # data wrangling
library(tidyr) # data wrangling
library(dplyr) # data manipulation
library(purrr) # data manipulation
library(data.table) # data manipulation
library(ggplot2) # visualisation
#library(ggpubr) # visualisation
library(plotly) # visualisation
library(lubridate) # date and time
#library(rgdal) # import GeoJSON
3.2 Data collection
First of all, we write a shell script to download original data from public websites
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-01.csv >201801.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-02.csv >201802.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-03.csv >201803.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-04.csv >201804.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-05.csv >201805.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-06.csv >201806.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-07.csv >201807.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-08.csv >201808.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-09.csv >201809.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-10.csv >201810.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-11.csv >201811.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv >201812.csv
3.3 Data Import & Cleaning
Then We use data.table:fread function to speed up reading in data for each month, and identify and select the vehicle company as type based on the license number. At the time, we also export subset monthly data into csv file as back up. we bind all monthly data into tibble format to perform our strucuted data. Each row contains trip information such as pick-up, drop-off date, time, location ID.
Due to local memeory issue in R, we process half-year data at one time, and use aggregation technique to compute results and perform entire year analysis. We will explain more detail below.
base = read.csv("./data/orignal/base_number.csv")$x %>% as.character()
load = function(i)
{
# set file path
x = paste("./data/orignal/2018", i, ".csv") %>% gsub(" ", "", .)
# load orginal data
# identify type
# remove useless columns
temp = fread(x) %>%
filter(Dispatching_base_number %in% c("B02510", "B02800", base)) %>%
mutate(type = ifelse(Dispatching_base_number == "B02510", "Lyft",
ifelse(Dispatching_base_number == "B02800", "Via", "Uber"))) %>%
select(-SR_Flag, -Dispatching_base_number, -Dispatching_base_num) %>%
as.tibble()
# export subset monthly data as back-up
a = paste("./data/", i, ".csv") %>% gsub(" ", "", .)
write.csv(temp, a)
return(data)
}
data = c()
for(i in 1:12)
{
# load original monthly data
temp = load(i)
# combine data
data = bind_rows(data, temp)
# optimize memory usage
remove(temp)
}
3.4 File structure and content
Let’s have an overview of the first 5000 Jan and Dec data. We find the time format is different, so we would like to convert to standard time stamp.
Using summary function to deeply understand the data distribution, and find missing value in DOlocationID
3.6 Data Missing & Outliers
We find there is 91,932 missing value in our dataset. More specific, there are 1,941 Lyft records missing pick-up location, and rest of that are completely bad data records. To conclude accurate analysis, we are going to remove all NA records.
data[which(is.na(data)), ]
Also, due to most companies allow to cancel the order in 2 minutes, we find there are 279,693 records showing the trip duration is less than 2 minutes.
data %>% mutate(duration.large.than.2 = duration > 2 ) %>%
group_by(duration.large.than.2) %>%
count
Then, we investage on those pick-up and drop-off location by using heat mapp.
vals = unique(scales::rescale(df$n))
o = order(vals, decreasing = FALSE)
cols = scales::col_numeric("Reds", domain = NULL)(vals)
colz = setNames(data.frame(vals[o], cols[o]), NULL)
plot_ly(data=df, x=~pick_borough,y=~drop_borough,z=~n, colorscale = colz, type = "heatmap")
To conclude, we believe those data are more likely cancelled trip order, so we are going to remove those as well.
3.7 Data Aggregation
By Solving local memory issue in R, since we are interested in the number of trips, and trip duraction, we don’t have to store all data into R. The idea is that we can process half-by-half year data and aggregate into different levels such as hour, weekday, day, and month. Then, we combine aggregated results to make visualization plots, which are much smaller.
data %>%
mutate(hour = hour(pick), wday = weekdays(pick), type) %>%
group_by(hour, wday, type) %>%
count
data %>%
mutate(month = month.abb[month(pick)]) %>%
group_by(month, type) %>%
summarise(d.med = median(duration))
V. Results
y = read.csv("./1 wday duration.csv") %>% mutate(wday = as.character(wday)) %>% as.tibble()
z = read.csv("./2 wday duration.csv") %>% mutate(wday = as.character(wday)) %>% as.tibble()
inner_join(y, z, by = "wday") %>% mutate(d.med = (d.total.x + d.total.y)/(n.x + n.y)) %>% select(wday, d.med) %>% write.csv(., "wday duration.csv")
bind_rows(y, z) %>% select(-X) %>% write.csv("./day duration.csv")
read.csv("hourly duration.csv") %>%
plot_ly(., x = ~ hour, y = ~ d.med, type ="scatter", mode = 'lines+markers',line = list(color="#2E86C1", width = 4), marker = list(size = 8, color = 'rgba(255, 182, 193, .9)', line = list(color = 'rgba(152, 0, 0, .8)', width = 2,simplyfy = F)), text = ~ paste("Hour: ", hour, '<br>Average duration:', round(d.med,2)), main="average duration vs hour") %>%
layout(title = 'Hourly Median Trip Duration(min)', yaxis = list(title = 'Duration', zeroline = FALSE),
xaxis = list(title = 'Hour',zeroline = FALSE))
'scatter' objects don't have these attributes: 'main'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
'scatter' objects don't have these attributes: 'main'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
lvl = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
read.csv("wday duration.csv") %>% mutate(wday = ordered(wday, levels = lvl)) %>% arrange(wday) %>%
plot_ly(., x = ~ wday, y = ~ d.med, type ="scatter", mode = 'lines+markers', line = list(width=4,simplyfy = F, color='rgb(114, 186, 59)'), marker = list(size = 8, color = '#8E44AD', line = list(color = '#3498DB', width = 2)), text = ~paste("Weekday: ", wday, '<br>Median duration:', round(d.med,2))) %>%
layout(title = 'Weekday Median Trip Duration(min) for FHV types',
yaxis = list(title = 'Duration',zeroline = FALSE),
xaxis = list(title = 'Weekday',zeroline = FALSE))
---
title: "Final Project"
author: "Jie Li"
date: "April 25, 2019"
output:
  html_notebook:
    code_folding: hide
  html_document:
    df_print: paged
---

## I. Introduction
This is a comprehensive Exploratory Data Analysis for billions of for-hire vehicle ("FHV": Uber, Lyft, Via, etc) trips originating in New York City, and we focuse on trip counts and duration in New York competition with tidy R and ggplot2.

The goal of this challenge is to load, process, understand the duration of fhv in NYC based on features: trip location, pick-up and drop-off time. Also, we are interested in the difference betweeen three companies such as market shares, targeted customers, and business strategy. Firstly, we will study and visualize the original data, engineer new features. Then, we add external NYC weather data sets to analyze the impact on the target trip duration. Finally, We compare three companies trips over various time frame on their target trip_duration values.

## II. Description of the data source
The data were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

The For-Hire Vehicle ("FHV") trip records since 2009 until present including fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID. We are focusing on the time period from **2018-01-01** to **2018-12-31**, so the data comes in the shape of 200+ million observations, and each row contains one trip infomation.

The base license number is matching with different vehicle companies, so that we will join the `base-number` file to define the vehicle types, and we only focus on Uber, Lyft, Via at this point.

The NYC Taxi Zones map provided by TLC and published to NYC Open Data(https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc). This map shows the NYC taxi zones corresponding to the pick up zomes and drop off zones, or location IDs, included in the FHV trip records. The taxi zones are roughly based on NYC Department of City Planning's Neighborhood Tabulation Areas (NTAs) and are meant to approximate neighborhoods.

The NYC Weather data is provided by National Centers For Environmental Information (https://www.ncdc.noaa.gov/data-access). NCEI is the world's largest provider of weather and climate data. Land-based, marine, model, radar, weather balloon, satellite, and paleoclimatic are just a few of the types of datasets available. The weather data we are using is collected from NY Central Park Station (USW00094728) from **2018-01-01** to **2018-12-31**, which contains daily weather records such as wind, precipitation, snow and snow depth.

Statistics through June 30, 2018:

* 17.2 GB of raw data
* 200+ million for-hire vehicle total trips
* 365 daily weather records

Existing problem:

* R reads entire data set into RAM all at once. Total 17.2 GB of raw data would not fit in local memory at once.
* R Objects live in memory entirely, which cause slowness for data analysis.
* The TLC publishes base trip record data as submitted by the bases, and we cannot guarantee or confirm their accuracy or completeness.


## III. Description of data import / cleaning / transformation
### 3.1 Libraries and Dependencies
```{r, warning=F, message=F}
library(tibble) # data wrangling
library(reshape2) # data wrangling
library(tidyr) # data wrangling
library(dplyr) # data manipulation
library(purrr) # data manipulation
library(data.table) # data manipulation
library(ggplot2) # visualisation
#library(ggpubr) # visualisation
library(plotly) # visualisation
library(lubridate) # date and time
#library(rgdal) # import GeoJSON
```

```{r, echo=F}
setwd("E:/NYC Taxi Project/")
```

### 3.2 Data collection
First of all, we write a `shell` script to download original data from public websites
```{r, eval=F}
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-01.csv >201801.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-02.csv >201802.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-03.csv >201803.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-04.csv >201804.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-05.csv >201805.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-06.csv >201806.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-07.csv >201807.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-08.csv >201808.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-09.csv >201809.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-10.csv >201810.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-11.csv >201811.csv
curl https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv >201812.csv
```

### 3.3 Data Import & Cleaning
Then We use `data.table:fread` function to speed up reading in data for each month, and identify and select the vehicle company as type based on the license number. At the time, we also export subset monthly data into `csv` file as back up. we bind all monthly data into `tibble` format to perform our strucuted data. Each row contains trip information such as pick-up, drop-off date, time, location ID.

Due to local memeory issue in R, we process half-year data at one time, and use **aggregation** technique to compute results and perform entire year analysis. We will explain more detail below.

```{r, eval=F}
base = read.csv("./data/orignal/base_number.csv")$x %>% as.character()

load = function(i)
{
  # set file path 
  x = paste("./data/orignal/2018", i, ".csv") %>% gsub(" ", "", .)
  
  # load orginal data
  # identify type
  # remove useless columns
  temp = fread(x) %>%
    filter(Dispatching_base_number %in% c("B02510", "B02800", base)) %>%
    mutate(type = ifelse(Dispatching_base_number == "B02510", "Lyft",
                         ifelse(Dispatching_base_number == "B02800", "Via", "Uber"))) %>%
    select(-SR_Flag, -Dispatching_base_number, -Dispatching_base_num) %>%
    as.tibble()
  
  # export subset monthly data as back-up
  a = paste("./data/", i, ".csv") %>% gsub(" ", "", .)
  write.csv(temp, a)

  return(data)
}

data = c()
for(i in 1:12)
{
  # load original monthly data
  temp = load(i)
  
  # combine data
  data = bind_rows(data, temp)
  
  # optimize memory usage
  remove(temp)
}
```

### 3.4 File structure and content
Let's have an overview of the first 5000 `Jan` and `Dec` data. We find the time format is different, so we would like to convert to standard time stamp.

```{r, echo=F}
Jan = read.csv("./data/01.csv") %>% select(-X)
Dec = read.csv("./data/012.csv") %>% select(-X)

Jan
Dec
```

Using `summary` function to deeply understand the data distribution, and find missing value in `DOlocationID`
```{r, echo=F}
read.csv("./summary Jan.csv") %>% select(-X) %>% as.tibble() 
read.csv("./summary Dec.csv") %>% select(-X) %>% as.tibble()
```

### 3.5 Data Transformation
Next, we use `lubridate:ymd_hms` and `lubridate:mdy_hms` transformat string to standard time stamp variables, and calucate the trip duration in **minute** by sbustracting drop-off time and pick-up time. Also, we factorize the company types to save memory usage and furture visualization.

```{r}
# Jan-Nov
Jan %>%
mutate(pick = ymd_hms(Pickup_DateTime), drop = ymd_hms(DropOff_datetime), duration = as.numeric(drop - pick)/60, type = factor(type)) %>%
select(-V1, -Pickup_DateTime, -DropOff_datetime)

# only for Dec
Dec %>%
mutate(pick = mdy_hm(Pickup_DateTime), drop = mdy_hm(DropOff_datetime), duration = as.numeric(drop - pick)/60, type = factor(type)) %>%
select(-V1, -Pickup_DateTime, -DropOff_datetime)
```

### 3.6 Data Missing & Outliers

We find there is 91,932 missing value in our dataset. More specific, there are 1,941 Lyft records missing pick-up location, and rest of that are completely bad data records. To conclude accurate analysis, we are going to remove all `NA` records.
```{r, eval=F}
data[which(is.na(data)), ]
```

```{r, echo=F}
x = read.csv("./missing.csv") %>% as.tibble() %>% mutate(pick = mdy_hm(pick), drop = mdy_hm(drop))
x
tail(x)
```

Also, due to most companies allow to cancel the order in 2 minutes, we find there are 279,693 records showing the trip duration is less than 2 minutes.

```{r, eval=F}
data %>% mutate(duration.large.than.2 = duration > 2 ) %>%
  group_by(duration.large.than.2) %>%
  count
```

```{r, echo=F}
read.csv("./2min_counts.csv") %>% as.tibble()
```


```{r, echo=F}
df = read.csv("./data/less_than_2.csv")
pick = read.csv("./dictionary_pickup.csv") %>% as.tibble()
drop = read.csv("./dictionary_dropoff.csv") %>% as.tibble()

match_zone = function(df,dictionary_pickup,dictionary_dropoff){

  matched.df = df %>% 
   left_join(dictionary_pickup,by="PUlocationID") %>% 
   mutate(pick_borough = borough) %>% 
   select(-borough)%>%
   left_join(dictionary_dropoff,by="DOlocationID") %>% 
   mutate(drop_borough = borough) %>%  
  select(-borough,-type,-pick,-drop) %>% 
   filter(!is.na(drop_borough)) %>% 
   filter(!is.na(pick_borough)) %>% 
  group_by(pick_borough,drop_borough) 

  return(matched.df)
}

df = match_zone(df,pick,drop) %>% count()
```

Then, we investage on those pick-up and drop-off location by using **heat mapp**. 
```{r}
vals = unique(scales::rescale(df$n))
o = order(vals, decreasing = FALSE)
cols = scales::col_numeric("Reds", domain = NULL)(vals)
colz = setNames(data.frame(vals[o], cols[o]), NULL)
plot_ly(data=df, x=~pick_borough,y=~drop_borough,z=~n, colorscale = colz, type = "heatmap")
```

To conclude, we believe those data are more likely cancelled trip order, so we are going to remove those as well.


```{r, echo=F}
read.csv("./2min_counts.csv") %>% as.tibble()
```

### 3.7 Data Aggregation
By Solving local memory issue in R, since we are interested in the number of trips, and trip duraction, we don't have to store all data into R. The idea is that we can process half-by-half year data and aggregate into different levels such as hour, weekday, day, and month. Then, we combine aggregated results to make visualization plots, which are much smaller.

```{r, eval=F}
data %>%
    mutate(hour = hour(pick), wday = weekdays(pick), type) %>%
    group_by(hour, wday, type) %>%
    count
```

```{r, echo=F}
read.csv("counts by h_wk_type.csv") %>% select(-X) %>% as.tibble()
```

```{r, eval=F}
data %>%
    mutate(month = month.abb[month(pick)]) %>%
    group_by(month, type) %>%
    summarise(d.med = median(duration))
```


```{r, echo=F}
read.csv("month type duration.csv") %>% select(-X) %>% as.tibble()
```

## V. Results
```{r, eval=F}
y = read.csv("./1 wday duration.csv") %>% mutate(wday = as.character(wday)) %>% as.tibble()

z = read.csv("./2 wday duration.csv") %>% mutate(wday = as.character(wday)) %>% as.tibble()

inner_join(y, z, by = "wday") %>% mutate(d.med = (d.total.x + d.total.y)/(n.x + n.y)) %>% select(wday, d.med) %>% write.csv(., "wday duration.csv")

bind_rows(y, z) %>% select(-X) %>% write.csv("./day duration.csv")

```

```{r, message=F, warning=F}
read.csv("hourly duration.csv") %>%
plot_ly(., x = ~ hour, y = ~ d.med, type ="scatter", mode = 'lines+markers',line = list(color="#2E86C1", width = 4), marker = list(size = 8, color = 'rgba(255, 182, 193, .9)', line = list(color = 'rgba(152, 0, 0, .8)', width = 2,simplyfy = F)), text = ~ paste("Hour: ", hour, '<br>Average duration:', round(d.med,2)), main="average duration vs hour") %>%
  layout(title = 'Hourly Median Trip Duration(min)', yaxis = list(title = 'Duration', zeroline = FALSE),
         xaxis = list(title = 'Hour',zeroline = FALSE))
```


```{r}
lvl = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

read.csv("wday duration.csv") %>% mutate(wday = ordered(wday, levels = lvl)) %>% arrange(wday) %>%
  plot_ly(., x = ~ wday, y = ~ d.med, type ="scatter", mode = 'lines+markers', line = list(width=4,simplyfy = F, color='rgb(114, 186, 59)'), marker = list(size = 8, color = '#8E44AD', line = list(color = '#3498DB', width = 2)), text = ~paste("Weekday: ", wday, '<br>Median duration:', round(d.med,2))) %>%
  layout(title = 'Weekday Median Trip Duration(min) for FHV types',
         yaxis = list(title = 'Duration',zeroline = FALSE),
         xaxis = list(title = 'Weekday',zeroline = FALSE))
```


